Introduction

This document summarizes the analysis and findings centered around craft breweries and beers. The data contains 2000+ beers and 500+ breweries. The purpose of this analysis is to determine whether there are any findings that would be beneficial to the Budwiser enterprise, provide leadership with data driven conclusions to make informed decisions, and/or allow for oversight on consumer preferences.

We appreciate you taking the time to review the information below. Please contact Inderbir Dhillon (Project Lead), Nick Blair and/or Jamie Vo if you have any questions.

Analysis

Load in all dependencies (libraries).

# load in the libraries
library(stringi)
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(plyr)
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:plotly':
## 
##     arrange, mutate, rename, summarise
library(class)
library(caret)
## Loading required package: lattice
library(e1071)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ✓ purrr   0.3.3
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange(), plotly::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks plotly::filter(), stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x purrr::lift()      masks caret::lift()
## x dplyr::mutate()    masks plyr::mutate(), plotly::mutate()
## x dplyr::rename()    masks plyr::rename(), plotly::rename()
## x dplyr::summarise() masks plyr::summarise(), plotly::summarise()
## x dplyr::summarize() masks plyr::summarize()

Read in the data

Read in data from files from csv files and display the data frames.

beer = read.csv("./data/Beers.csv",header = TRUE)
breweries = read.csv("./data/Breweries.csv",header = TRUE, strip.white = TRUE)

#display the dataframes
beer
breweries

How many breweries are there in each state?

Count the number of breweries in each state and check for unique breweries in each state.

str(breweries) # check that State is a Factor
## 'data.frame':    558 obs. of  4 variables:
##  $ Brew_ID: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name   : Factor w/ 551 levels "10 Barrel Brewing Company",..: 355 12 266 319 201 136 227 477 59 491 ...
##  $ City   : Factor w/ 384 levels "Abingdon","Abita Springs",..: 228 200 122 299 300 62 91 48 152 136 ...
##  $ State  : Factor w/ 51 levels "AK","AL","AR",..: 24 18 20 5 5 41 6 23 23 23 ...
state_breweies <- breweries %>% group_by(State) %>% tally() # count the number of breweries within a state
unique_state_breweies <- breweries %>% group_by(State) %>% tally(n_distinct(Name)) # check for any duplicates

Heat Map of Breweies per State

Create a heat map of the breweries per state to visualize density.

## add lowercase state name for heat map
lstates = tolower(state.name)
state_count = state_breweies %>% 
    add_rownames("region") %>% 
    mutate( region=lstates[match(State, state.abb)] )
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## find center of each state for text position 
snames <- data.frame(region=lstates, long=state.center$x, lat=state.center$y)
snames <- merge(snames, state_count, by="region")

## merge map data with counts data
choro <- left_join(
  map_data("state"), 
  state_count
)
## Joining, by = "region"
ggplot(choro, aes(long, lat)) +
  geom_polygon(aes(group = group, fill = n)) + 
  geom_text(data=snames, aes(long, lat, label=n)) +
  ggtitle("Breweries per State")

  coord_quickmap()
## <ggproto object: Class CoordQuickmap, CoordCartesian, Coord, gg>
##     aspect: function
##     backtransform_range: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_params: function
##     setup_params: function
##     transform: function
##     super:  <ggproto object: Class CoordQuickmap, CoordCartesian, Coord, gg>

The range of breweries per state ranges from 1 to 47, with Colorado holding the highest number of breweries per state. The heat map above shows North and South Dakotas along with West Virginia as having the lowest number of breweries (1 per state). Following Colorado is California with 39 and Oregon with 29 breweries. There appears to be higher number of breweries in the north east and west coast, as opposed to the central states, with the exception of Colorado and Texas.

The high density of breweries may indicate greater competition, as well as a large audience for beer.

Cleaning the data

Prepering/transforming the data into a usable form for analysis, visualization, etc… #### Merging Dataframes Merge beer data with the breweries data. Print the first 6 observations and the last 6 observations to check the merged file. The two data sets were merged using the Brewer_id value. The first and last 6 rows of data were shown to spot-check accuracy.

attach(beer)
beer[order(Brewery_id),] # sort the data to determine column for merge
# merge on Brewery ID
breweries_named <- plyr::rename(breweries, c("Brew_ID"="Brewery_id"))

brewing_beer <- merge(breweries_named,beer,by="Brewery_id", all=TRUE) # outter join

brewed_beer <- plyr::rename(brewing_beer, c("Name.x"="Brewery", "Name.y"="Beer")) # rename breweries and beer

head(brewed_beer,6) # show the first 6 rows of data
tail(brewed_beer,6) # show the last 6 rows of data

Missing Data

Missing data are in columns ABV (62) and IBU (1005) only. Cleaning data in multiple options: 1. complete records only 2. replacing NA with the averages of the remainder of the column

# method 1: electing only complete cases
which(is.na(brewed_beer)) # determine which rows contain NA
##    [1] 14521 14531 14653 14656 14689 14712 14898 14899 14968 14969 14970 14971
##   [13] 15028 15029 15030 15031 15032 15033 15034 15087 15249 15251 15254 15360
##   [25] 15450 15455 15457 15462 15614 15761 15885 15939 16091 16354 16357 16380
##   [37] 16382 16410 16555 16637 16638 16641 16687 16694 16702 16719 16720 16721
##   [49] 16722 16756 16804 16811 16812 16813 16814 16815 16816 16817 16825 16827
##   [61] 16854 16855 16889 16905 16906 16907 16908 16909 16910 16911 16912 16913
##   [73] 16915 16916 16917 16918 16919 16920 16921 16923 16925 16926 16927 16928
##   [85] 16929 16930 16931 16932 16933 16934 16935 16936 16937 16938 16939 16940
##   [97] 16941 16942 16943 16944 16945 16946 16947 16948 16949 16950 16951 16952
##  [109] 16953 16954 16955 16956 16957 16958 16959 16960 16961 16962 16963 16964
##  [121] 16965 16966 16967 16968 16969 16970 16971 16972 16973 16974 16975 16976
##  [133] 16977 16978 16979 16980 16981 16982 16983 16984 16985 16986 16987 16992
##  [145] 16994 16997 16998 17004 17009 17012 17022 17028 17030 17031 17033 17036
##  [157] 17038 17039 17040 17043 17057 17063 17066 17070 17076 17078 17084 17085
##  [169] 17087 17092 17093 17099 17107 17110 17111 17113 17122 17123 17124 17125
##  [181] 17126 17127 17129 17132 17135 17137 17139 17140 17141 17142 17143 17145
##  [193] 17149 17152 17157 17159 17160 17163 17164 17165 17167 17168 17169 17170
##  [205] 17171 17172 17173 17174 17175 17176 17177 17178 17179 17185 17192 17202
##  [217] 17224 17230 17231 17232 17233 17234 17236 17239 17240 17242 17243 17246
##  [229] 17247 17248 17252 17254 17257 17258 17259 17261 17262 17263 17264 17267
##  [241] 17269 17273 17280 17281 17282 17283 17284 17285 17286 17287 17288 17298
##  [253] 17299 17307 17308 17309 17312 17315 17316 17317 17318 17319 17321 17322
##  [265] 17323 17324 17327 17329 17334 17335 17338 17347 17348 17352 17355 17358
##  [277] 17362 17363 17371 17372 17373 17374 17375 17376 17377 17378 17379 17380
##  [289] 17381 17403 17408 17409 17410 17411 17412 17413 17415 17417 17427 17432
##  [301] 17433 17437 17438 17439 17440 17441 17442 17443 17444 17445 17447 17455
##  [313] 17456 17463 17465 17466 17467 17468 17469 17470 17480 17482 17483 17484
##  [325] 17485 17488 17491 17496 17497 17498 17499 17504 17505 17506 17517 17519
##  [337] 17520 17521 17524 17527 17528 17529 17531 17532 17533 17534 17552 17553
##  [349] 17570 17576 17583 17584 17588 17593 17600 17601 17602 17616 17618 17620
##  [361] 17621 17626 17632 17636 17645 17651 17659 17660 17661 17664 17665 17675
##  [373] 17676 17677 17678 17679 17680 17681 17682 17683 17684 17685 17686 17687
##  [385] 17695 17696 17698 17699 17700 17701 17706 17707 17712 17715 17717 17718
##  [397] 17721 17724 17727 17731 17732 17733 17745 17746 17747 17748 17749 17750
##  [409] 17751 17752 17753 17754 17755 17758 17760 17761 17765 17770 17771 17781
##  [421] 17782 17783 17784 17791 17802 17803 17804 17816 17818 17822 17824 17825
##  [433] 17826 17828 17830 17833 17834 17837 17838 17860 17861 17863 17864 17865
##  [445] 17867 17871 17872 17873 17879 17881 17885 17887 17892 17893 17894 17897
##  [457] 17899 17902 17903 17905 17914 17920 17921 17924 17925 17926 17927 17928
##  [469] 17929 17930 17931 17932 17933 17934 17935 17936 17937 17938 17939 17940
##  [481] 17941 17942 17943 17944 17945 17946 17948 17949 17951 17952 17953 17954
##  [493] 17955 17956 17957 17958 17959 17960 17961 17962 17963 17964 17965 17967
##  [505] 17974 17988 17989 17990 17991 17992 17993 17998 18002 18008 18017 18021
##  [517] 18022 18024 18025 18026 18029 18030 18031 18033 18049 18050 18051 18052
##  [529] 18053 18055 18056 18058 18059 18068 18069 18070 18071 18072 18073 18074
##  [541] 18077 18078 18079 18080 18081 18083 18084 18085 18086 18087 18088 18091
##  [553] 18100 18101 18102 18103 18104 18107 18108 18109 18110 18111 18112 18113
##  [565] 18124 18125 18131 18137 18157 18161 18162 18163 18167 18168 18169 18170
##  [577] 18171 18172 18173 18180 18185 18197 18198 18199 18200 18201 18205 18211
##  [589] 18219 18232 18240 18243 18246 18247 18248 18251 18252 18255 18256 18257
##  [601] 18258 18259 18260 18261 18262 18263 18264 18266 18267 18268 18269 18274
##  [613] 18281 18283 18284 18285 18286 18287 18288 18291 18292 18293 18295 18296
##  [625] 18299 18301 18303 18313 18316 18317 18318 18319 18320 18321 18326 18338
##  [637] 18339 18340 18341 18347 18348 18349 18350 18358 18363 18371 18373 18377
##  [649] 18378 18380 18381 18383 18385 18396 18397 18411 18412 18414 18416 18417
##  [661] 18419 18420 18421 18422 18423 18425 18429 18433 18434 18435 18436 18437
##  [673] 18438 18441 18442 18444 18445 18446 18447 18449 18450 18455 18461 18463
##  [685] 18467 18469 18470 18472 18473 18474 18475 18477 18479 18480 18483 18484
##  [697] 18496 18497 18498 18499 18500 18501 18502 18503 18504 18505 18506 18509
##  [709] 18510 18520 18521 18524 18525 18531 18532 18533 18534 18535 18536 18537
##  [721] 18538 18544 18545 18546 18547 18548 18549 18550 18553 18554 18555 18562
##  [733] 18576 18577 18578 18581 18582 18597 18598 18599 18605 18607 18608 18609
##  [745] 18610 18611 18612 18614 18615 18618 18620 18623 18624 18625 18632 18634
##  [757] 18639 18643 18649 18650 18651 18652 18653 18654 18655 18660 18666 18667
##  [769] 18674 18675 18678 18689 18695 18700 18702 18703 18704 18705 18707 18709
##  [781] 18710 18711 18723 18724 18725 18731 18732 18735 18739 18743 18746 18747
##  [793] 18748 18752 18753 18756 18757 18758 18759 18761 18762 18763 18764 18766
##  [805] 18767 18770 18780 18783 18784 18786 18787 18788 18789 18790 18791 18792
##  [817] 18793 18794 18795 18797 18799 18800 18801 18802 18809 18810 18811 18812
##  [829] 18813 18818 18819 18820 18822 18824 18829 18830 18831 18832 18833 18836
##  [841] 18843 18844 18845 18846 18847 18848 18850 18851 18853 18854 18855 18861
##  [853] 18862 18863 18864 18865 18866 18867 18869 18871 18872 18873 18874 18876
##  [865] 18877 18878 18879 18880 18884 18885 18886 18887 18888 18892 18893 18902
##  [877] 18903 18904 18905 18906 18910 18911 18912 18913 18914 18915 18917 18919
##  [889] 18920 18923 18925 18933 18934 18935 18936 18941 18942 18943 18944 18945
##  [901] 18946 18947 18948 18952 18954 18955 18956 18957 18958 18959 18964 18965
##  [913] 18966 18973 18975 18976 18982 18983 18984 18985 18995 19003 19004 19005
##  [925] 19006 19007 19013 19014 19015 19016 19017 19022 19024 19025 19026 19027
##  [937] 19030 19032 19033 19038 19039 19040 19041 19042 19045 19047 19048 19049
##  [949] 19051 19052 19053 19054 19055 19056 19057 19066 19067 19068 19070 19074
##  [961] 19075 19076 19083 19085 19086 19087 19097 19098 19104 19108 19109 19112
##  [973] 19115 19116 19122 19123 19124 19129 19130 19131 19132 19135 19136 19137
##  [985] 19138 19140 19141 19142 19144 19156 19157 19158 19160 19161 19162 19163
##  [997] 19164 19165 19166 19175 19176 19177 19178 19179 19180 19182 19186 19187
## [1009] 19193 19194 19195 19196 19197 19200 19201 19202 19203 19212 19213 19214
## [1021] 19215 19219 19221 19222 19223 19224 19225 19226 19227 19231 19232 19233
## [1033] 19234 19235 19236 19237 19238 19239 19240 19241 19242 19243 19244 19245
## [1045] 19248 19249 19252 19253 19262 19263 19264 19265 19266 19267 19268 19269
## [1057] 19270 19271 19272 19273 19274 19275 19276 19277 19278 19279 19280
colSums(is.na(brewed_beer)) # summary of the number of NA in each column
## Brewery_id    Brewery       City      State       Beer    Beer_ID        ABV 
##          0          0          0          0          0          0         62 
##        IBU      Style     Ounces 
##       1005          0          0
completed_beer <- brewed_beer[complete.cases(brewed_beer), ] # df with only complete records

# method 2: replacing NA with averages
averaged_beer <- brewed_beer # make a duplicate of the original df to manipulate
averageABV <- mean(averaged_beer$ABV, na.rm = TRUE) # set average variable for ABV
averageIBU <- mean(averaged_beer$IBU, na.rm = TRUE) # set average variable for IBU

#replace NA values with the average of the present values
averaged_beer$IBU[is.na(averaged_beer$IBU)]<-averageIBU
averaged_beer$ABV[is.na(averaged_beer$ABV)]<-averageABV


colSums(is.na(averaged_beer))
## Brewery_id    Brewery       City      State       Beer    Beer_ID        ABV 
##          0          0          0          0          0          0          0 
##        IBU      Style     Ounces 
##          0          0          0

The areas missing data are confined to the alcohol by volume (ABV) and international bitterness units (IBU) data values. The missing ABV may be explained due by state laws that do or do not regulate the labelling of alcohol content on beers. Breweries in states that do not require the ABV to be labelled may not have this data. The lack of IBU data is likely due to bitterness testing requiring lab equipment, personelle, and addiitonal expenses that are not required for the manufacturing of beer.

The data was mutated in two ways: method 1. which only contains data where all testing variables were complete and method 2. where the missing data were replaced with the averages of the available data. It was determined that when removing rows with missing data, the primary difference was with the lower ABV, which is originally 0.1% rather than 2.7% with missing values removed

Median Alcohol Content

Median of Alcohol by Volume and Bitterness by State

# group by state, get median of ABV, IBU
medians <- as.data.frame(aggregate(completed_beer[,c(7,8)], by=list(completed_beer$State), FUN=median)) 
median_df <- plyr::rename(medians, c("Group.1"="State")) # rename the column to State

median_graph <- median_df %>% ggplot(aes(x = ABV, y = IBU, color=State)) + geom_point() + ggtitle("Median Alcohol Content and Bitterness by State") # plot scatter plot

ABV_bar <-ggplot(data=median_df, aes(x = State, y = ABV, fill = State)) +
  geom_bar(stat="identity", width = 0.75) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Median ABV by State")
ggplotly(ABV_bar)
IBU_bar <-ggplot(data=median_df, aes(x = State, y = IBU, fill = State)) +
  geom_bar(stat="identity", width = 0.75) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Median IBU by State")
ggplotly(IBU_bar)
mean(median_df$ABV)
## [1] 0.05564
mean(median_df$IBU)
## [1] 36.98

We use median as our measure of center to prevent outliers from skewing the data and causing errors to interpretation

In comparing the median ABV, Arizona and Texas have the lowest alcohol by volume at 4.00%. Maine and Colorado have the two highest alcohol by volumes at 6.7% and 6.5%, respectively. While there is no technical maximum on the bitterness scale, the majority of beers are typically below 100 IBU, with pale ales near 40 IBU and wheats near 13 IBU. While some beers are capable of breaking 100 IBU, the majority are shown to be between 20 and 40 IBU, with Wisconsin having the lowest score of 19 IBU and Maine having the highest score of 61 IBU. The average median across all 50 states is 5.6% ABV and 36.98 IBU.

An interesting note to make is Maine having the highest median alcohol content accompanied by the highest median bitterness score.

Max ABV and IBU

Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?

# find the max ABV and IBU from each state
maximums <- as.data.frame(aggregate(completed_beer[,c(7,8)], by=list(completed_beer$State), FUN=max))
max_df <- plyr::rename(maximums, c("Group.1"="State")) # rename the column to State

#find the max ABV and IBU states
max_ABV <- max_df %>% filter(ABV == max(ABV))
max_IBU <- max_df %>% filter(IBU == max(IBU))

max_state <- rbind(max_ABV, max_IBU)
max_state

Reviewing the original data rather than the medians of each state, Kentucky had the highest ABV of all states, with 12.5% alcohol by volume. Oregon has the highest bitterness score of 138 ibu. We can see that both the ABV and IBU of the two beers are above the average medians of the states (5.6% and 39.98 ibu, respectively).

This leads us to question whether there is a correlation with alcohol by volume and bitterness.

Summary Statistics

The summary statistics and distribution of the ABV variable.

#Creates column representing ABV in percentage which is more user-readable and in-line with how
#ABV is represented by the industry
print("Values below are in percentage (%)")
## [1] "Values below are in percentage (%)"
#Using method 1 for handling missing data
completed_beer$ABVpercent <- completed_beer$ABV*100 

summary(completed_beer$ABVpercent) #Range, quartiles, and mean of ABV percentage value
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.700   5.000   5.700   5.991   6.800  12.500
cat("Standard deviation: ", sd(completed_beer$ABVpercent)) #Standard deviation of ABV percentage value
## Standard deviation:  1.357633
#Using method 2 for handling missing data
averaged_beer$ABVpercent <- averaged_beer$ABV*100

summary(averaged_beer$ABVpercent)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.100   5.000   5.700   5.977   6.700  12.800
cat("Standard deviation: ", sd(averaged_beer$ABVpercent))
## Standard deviation:  1.336634

Summary statistics provided for ABV (%) are the data range, quartiles, mean, and standard deviation. These values differ slightly depending on which correction method is used to handle missing data in the set.

Using method 1 (ommitting non-complete records): Min: 2.7% ABV 1st: 5.0% ABV Median: 5.7% ABV Mean 5.991% ABV 3rd: 6.8% ABV Max: 12.5% ABV SD: 1.358 %ABV

Using method 2 (replacing NA values with averages or the remaining column data): Min: 0.100% 1st: 5.0% Median: 5.7% Mean: 5.977% 3rd: 6.7% Max: 12.8% SD: 1.337%

Relationship between Bitterness and Alcohol Content

Is there an apparent relationship between the bitterness of the beer and its alcoholic content? A scatter plot is used for visualization.

#Uses method 1 for handling missing data (remove incomplete lines)

#Correlation between ABV percentage and IBU; function uses Pearson method as default
cor(x=completed_beer$ABVpercent, y=completed_beer$IBU)
## [1] 0.6706215
#Visualizations between ABV percentage and IBU
ggplot(data=completed_beer, mapping=aes(x= IBU, y= ABVpercent, position_jitter())) + geom_point()

# filter down to the top 12 most popular beers
popular_styles = completed_beer %>% 
  group_by(Style) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>%
  head(12)

top_completed_beer = completed_beer %>% filter(Style %in% popular_styles$Style)

print(nrow(top_completed_beer) / nrow(completed_beer))
## [1] 0.6498221
ggplot(data=top_completed_beer, mapping=aes(x= ABVpercent, y= IBU, position_jitter())) +
  geom_point() +
  facet_wrap(~Style)

n_styles = nrow(popular_styles)
popular_styles$ABV_IBU_corr = numeric(n_styles)
for(i in 1:n_styles)  #n_styles)
{
  style = popular_styles[i, 1]$Style
  style_beer = top_completed_beer %>% filter(Style == style)
  popular_styles$ABV_IBU_corr[i] = cor(x=style_beer$ABVpercent, y=style_beer$IBU)
}
popular_styles
#Uses method 2 for handling missing data (replace NAs with average of present data)

#Correlation between ABV percentage and IBU; function uses Pearson method as default
cor(x=averaged_beer$ABVpercent, y=averaged_beer$IBU)
## [1] 0.520011
#Visualizations between ABV percentage and IBU
ggplot(data=averaged_beer, mapping=aes(x= IBU, y= ABVpercent, position_jitter())) + geom_point()

Looking at the scatter plot of ABV vs IBU, we see that there is a reasonably strong linear trend between ABV and IBU where higher ABVs are associated with higher IBUs. This is backed up when we see a reasonably stong correlation of 0.671 (using missing data correction method 1 which omitted non-complete records - method 2 of replacing NA values with the average of the present data yielded a similar but lower correlation of 0.520).

Going furthar, we looked at the 12 most popular beer types (which represent 65% of the beers sampled) and looked at the correlation between IBU and ABV. Interestingly, the association between ABV and IBU was not strong within the styles of beer. All the correlations were less than the 0.671 we saw overall. Some (such as American Poter, and American Double/Imperial IPA) had low correlations and scatter plots that looked random.

It seems that some styles of beer have higher IBUs and that those styles will also have higher ABVs, however within a group there is IBUs and ABU aren’t as strongly tied. This means that when releasing a new beer, it might be more important to make sure the ABU and IBUs are in line with other beers of the style than trying to match some ratio of IBU and ABV.

IPAs vs. Ales

Investigation between the difference of IBU to ABV between IPAs and Ales.

We used KNN (K-nearest neighbor) to classify the relationship, which classifies a beer based on its nearest k-number of beers. (e.g. if n is the beer in question and 4 of 5 of the nearest beers are higher in IBU, we would classify n to have higher IBU as well).

## create a df for only IPAs and other Ales
aleSplit <- function(beers) {
  ipa = beers[grep("IPA", beers$Style), ]
  ipa$Style = beers[grep("IPA", beers$Style), ]
  ipa$Style = "IPA"
  
  not_ipa = beers[-grep("IPA", beers$Style), ]
  ales = not_ipa[grep("Ale", not_ipa$Style), ]
  ales$Style = "Ale"
  
  allAles = rbind(ipa, ales)
  return(allAles)
}

## Create kNN for k from 1 to 30 and return accuracy df
classifyKnn <- function(ales) {
  # 60/40 train/test split
  set.seed(0)
  splitPerc = .60
  trainIndices = sample(1:dim(ales)[1], round(splitPerc * dim(ales)[1]))
  train = ales[trainIndices,]
  test = ales[-trainIndices,]
  trainLabels = train$Style
  testLabels = test$Style
  train = subset(train, select = c(ABV, IBU))
  test = subset(test, select = c(ABV, IBU))
  
  # Scale the features 
  train = as.data.frame(scale(train))
  test = as.data.frame(scale(test))
  
  # get accuracy score for kNN for k=1 to k=30
  neighbors = 30
  accs = data.frame(accuracy = numeric(neighbors), k = numeric(neighbors))
  for(i in 1:neighbors)
  {
    classifications = knn(train, test, trainLabels, k = i)
    table(testLabels, classifications)
    CM = confusionMatrix(table(testLabels, classifications))
    accs$accuracy[i] = CM$overall[1]
    accs$k[i] = i
  }
  
  return(accs)
}


completedAle = aleSplit(completed_beer)
avgAle = aleSplit(averaged_beer)

## Looks at the graphs of IBU and ABV
ggplot(data=completedAle, mapping=aes(x = Style, y = IBU)) + geom_boxplot() + 
  ggtitle("IBU of IPAs vs Other Ales")

ggplot(data=completedAle, mapping=aes(x = Style, y = ABVpercent)) + 
  geom_boxplot() +
  ylab('ABV')

  ggtitle("ABV of IPAs vs Other Ales")
## $title
## [1] "ABV of IPAs vs Other Ales"
## 
## attr(,"class")
## [1] "labels"
ggplot(data=completedAle, mapping=aes(x=ABVpercent, y=IBU, color=Style, position_jitter())) + geom_point() + xlab('% ABV') + ggtitle('ABV and IBU of IPAs vs other Ales')

complete_accs = classifyKnn(completedAle)
avg_accs = classifyKnn(avgAle)

## Look at accuracy plots to find best k to report the scores
plot(complete_accs$k,complete_accs$accuracy, type = "l", xlab = "k", ylab="Accuracy")

complete_accs[15,]
plot(avg_accs$k,avg_accs$accuracy, type = "l", xlab = "k", ylab="Accuracy")

avg_accs[5,]

Now we look at boxplots of IBU and ABV for IPAs and other Ales. We can see that IPAs have a higher ABV and IBU than other ales. Looking at a scatter plot broken down by color, the distinction between the two style is pretty clear with the IPAs strongly clustered in the high ABV/high IBU part of the chart and the other ales more centered at the low ABV/low IBU part of the chart.

To further explor this, we built attempted to classify the beers as IPA or other Ale based on only the IBU and ABV information using an algroithm called kNN and found very promising results. With the type 1 correction of missing values (leaving them out), we were able to accuractely classify 88% of ales as IPA or other based only on IBU and ABV. With the type 2 correction, this accuracy was 83%.

IBUs of West Coast IPAs

## add features for if the beer is an IPA or if the brewer is from the west coast
westCoast = c('CA', 'OR', 'WA')
completed_beer$WestCoast <- ifelse(completed_beer$State %in% westCoast, "West Coast", "Other States")
completed_beer$IPA <- ifelse(grepl("IPA", completed_beer$Style), "IPA", "Other")

ipa = completed_beer %>% filter(IPA=='IPA')
state_ipa <- ipa %>% group_by(State) %>% tally() # count the number of breweries within a state


### Map of IPAs per state ###
## add lowercase state name for heat map
lstates = tolower(state.name)
state_count = state_ipa %>% 
    add_rownames("region") %>% 
    mutate( region=lstates[match(State, state.abb)] )
## Warning: Deprecated, use tibble::rownames_to_column() instead.
## find center of each state for text position 
snames <- data.frame(region=lstates, long=state.center$x, lat=state.center$y)
snames <- merge(snames, state_count, by="region")

## merge map data with counts data
choro <- left_join(
  map_data("state"), 
  state_count
)
## Joining, by = "region"
ggplot(choro, aes(long, lat)) +
  geom_polygon(aes(group = group, fill = n)) + 
  geom_text(data=snames, aes(long, lat, label=n)) +
  coord_quickmap() +
  ggtitle("IPAs per State")

### end map ###

## Chart of IPA perentage in West Coast vs other regions
ggplot(data=completed_beer, aes(WestCoast)) +
  geom_bar(aes(fill=as.factor(IPA)), position="fill") +
  xlab('Region') +
  labs(fill='Beer Style') 

ggplot(data=ipa, mapping=aes(x = WestCoast, y = IBU)) + geom_boxplot() + 
  ggtitle("IBU of West Coast IPAs vs other IPAs") +
  xlab('Region')

## Density plots of IBU in West Coast vs other regions
ggplot(ipa, aes(IBU)) + 
  geom_density(aes(fill=WestCoast), alpha=0.8)+ 
  labs(title="IBUs for West Coast vs Other IPAs", fill="Region")

t.test(ipa$IBU~ipa$WestCoast)
## 
##  Welch Two Sample t-test
## 
## data:  ipa$IBU by ipa$WestCoast
## t = -0.56854, df = 158.16, p-value = 0.5705
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.762931  3.186699
## sample estimates:
## mean in group Other States   mean in group West Coast 
##                   71.64667                   72.93478

Finally, we wanted to explore IBU of IPAs in the West coast. The West Coast IPA is a category of IPA that is popular along the pacific coast and features a heavy use of hops, which is usually paired with a high IBU. West Coast IPAs are not specifically labelled in this dataset so we wanted to see if there was a difference in IPAs in the West Coast in general.

Looking at the heat maps of IPAs per state we see that the West coast does seem to be the main hub for IPAs, even accounting for the increased number of breweries in the West coast states. In fact, a chart of the percentage of beers that were IPAs does show that the West Coast has a higher percentage of IPAs.

Having established the abundance of IPA in the West coast, the next issue is the IBU. Box plots for the West coast and other reagions show that the average IBU is about the same. The chart of the distributions, however, does show an interesting pattern. The graphs mostly overlap, but there is a hump where the West coast states stand out at about the 80-100 IBU range. What this means is that there are more beers in the 80-100 IBU range in the West Coast.

From a business prespective, this means that the averge IPA in the west coast is about as bitter as any other region. However, there does seem to be a elevated number of IPAs in the 80-100 IBU range in the West coast. This range seems to be in line with the style of West Coast IPA. If you wanted to target a niche beer towards West Coast IPA fans, this would likely be the IBU range to aim for.

Conclusion

We discovered that Colorado and California are the states with highest numbers of breweries (47 and 39 respectively). If we remove one of the two same breweries in Colorado, that number drops to 46. In addition to duplicates effecting the data, removing records that had missing values resulted in the minimum ABV to go from 0.1% to 2.7%. Since no other siginificant changes to the data summaries were observed, analysis continued with the removal of missing data.

The median ABV and IBU were found to be 5.6% and 39.98, and the max to be 12.5% (Kentucky) and 138 (Oregon), respectively. Interestingly, we found a strong, positive linear relationship between ABV and IBU. After additional analysis, we can see that IPAs have a higher ABV and IBU than other ales, and by using a KNN algorithim, accuractely predict if its an IPA with 83% accuracy. In continuation of the Ale study, there is evidence to suggest that IPAs with higher IBUs are most prominent on the west coast.

Additional data for analysis is recommended to support the findings found in this analysis. Regardless, it is interesting to observe high densities of breweries in Colorado and California. In addiiton, the data may suggest that higher ABV and IBU than average may flourish on the west coast (California) considering its prominence in the area.